A Character String-Based Stemming for Morphologically Derivative Languages
نویسندگان
چکیده
Morphologically derivative languages form words by fusing stems and suffixes, are important to be extracted in order make cross lingual alignment knowledge transfer. As there phonetic harmony disharmony when linguistic particles combined, both morphological changes need analyzed. This paper proposes a multilingual stemming method that learns morpho-phonetic automatically based on character embedding sequential modeling. Firstly, the feature at sentence level is used as input, BiLSTM model obtain forward reverse context sequence, attention mechanism added this for weight learning, global information capture stem affix boundaries; finally CRF learn more from sequence features describe effectively. In verify effectiveness of above model, compared with traditional two different data sets three languages: Uyghur, Kazakh Kirghiz. The experimental results show has best effect sentence-level datasets, which leads effective stemming. addition, proposed outperforms other models, fully consider characteristics, certain advantages less human intervention.
منابع مشابه
Statistical Stemming of Morphologically Rich Languages
We analyze current machine translation into Russian, a morphologically rich language, and present a technique for unsupervised statistical stemming. An initial pass is based on intuitions, and uses a CUDA kernel. As a later pass, we run EM. Since our model is relatively simple, the EM probabilities factorize into components that can be solved independently. While our results were not exactly gr...
متن کاملStemming Strategies for European Languages
In this paper, we describe and evaluate different general stemming approaches for the French, Portuguese (Brazilian), German and Hungarian languages. Based on the CLEF test-collections, we demonstrate that light stemming approaches are quite effective for the French, Portuguese and Hungarian languages, and perform reasonably well for the German language. Variations in mean average precision amo...
متن کاملCharacter-based PSMT for Closely Related Languages
Translating unknown words between related languages using a character-based statistical machine translation model can be beneficial. In this paper, we describe a simple method to combine character-based models with standard word-based models to increase the coverage of a phrase-based SMT system. Using this approach, we can show a modest improvement when translating between Norwegian and Swedish...
متن کاملStemming Approaches for East European Languages
During this CLEF evaluation campaign, the first objective is to propose and evaluate various indexing and search strategies for the Czech language that will hopefully result in more effective retrieval than language-independent approaches (n-gram). Based on the stemming strategy we developed for other languages, we propose that for the Slavic language a light stemmer (inflectional only) and als...
متن کاملCharacter Composition Model with Convolutional Neural Networks for Dependency Parsing on Morphologically Rich Languages
We present a transition-based dependency parser that uses a convolutional neural network to compose word representations from characters. The character composition model shows great improvement over the word-lookup model, especially for parsing agglutinative languages. These improvements are even better than using pre-trained word embeddings from extra data. On the SPMRL data sets, our system o...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Information
سال: 2022
ISSN: ['2078-2489']
DOI: https://doi.org/10.3390/info13040170